(1) Conduct an exploratory data analysis (EDA) of the TRAIN_SET.CSV. Provide an overview of the data and any underlying patterns you may identify. Without a thorough data dictionary, you may have to make some assumptions about the data. Document any transformations you perform. (10 points)

image.png

*A good place to start would be col 27 NEXT_INSPECTION_GRADE_C_OR_BELOW since it is the variable we are trying to predict in the following question.

I need to look into the unique values and in general each col starting with col 27

I am really just interested in las vegas restaurants since this is alas vegas related dataset

SLOWLY BUT SURELY MAKE MY WAY DOWN EVERY FIELD AND CHECK FOR IRREGULARITIES

MOVING ON TO THE NEXT COL

ONTO THE NEXT FIELD

NEXT FIELD, EMPLOYEE_COUNT

NEXT FIELD, MEDIAN_EMPLOYEE_AGE

NEXT FIELD, MEDIAN_EMPLOYEE_TENURE

NEXT FIELD, INSPECTION_TIME

NEXT FIELD, INSPECTION_DEMERITS

NEXT FIELD, VIOLATIONS_RAW

NEXT FIELD, RECORD_UPDATED

NEXT FIELD, LAT_LONG_RAW

seems fine after accounting for state and city outliers

FIRST_VIOLATION + SECOND_VIOLATION + THIRD_VIOLATION

NEXT FIELD, FIRST_VIOLATION_TYPE + SECOND_VIOLATION_TYPE + THIRD_VIOLATION_TYPE

NEXT FIELD, NUMBER_OF_VIOLATIONS

NEXT FIELD, NEXT_INSPECTION_GRADE_C_OR_BELOW

KEEPING IN MIND INDEX 1622, 546, 4885

END OF PRELIMINARY CLEANING, NOW OFF TO EDA

City Restaurant Type Analysis

(2) Build a simple model that predicts the outcome of a restaurant’s next inspection, using NEXT_INSPECTION_GRADE_C_OR_BELOW as the response. Use your knowledge of the data and your own statistical expertise to develop an appropriate model. Document your thought process, the model techniques you considered, and the evaluation of the trained model. (5 points)*

(3) Assume now that your model was deployed into production. The business partner is concerned because it appears to be underperforming on new observations. She has provided a production data extract for you to analyze: PRODUCTION_SAMPLE.csv. Perform a model drift analysis to help explain the performance discrepancy. Drift analysis identifies possible differences in the training and production datasets. (35 points)

• What techniques or metrics did you use to identify potential drift? • What columns have changed and in what ways? • How would you assess the overall impact of the affected columns on the model performance? • How would you address your findings with the business partner? What would you recommend?

image.png

https://arxiv.org/pdf/2004.05785.pdf

I am going to be performing the same cleaning steps to the production dataset

start cleaning the city variable becuase i will be deleteing every column before this field

looking at the table i think maybe 1 is immintent health hazard and i am proceeding as is

CLEANING DONE

need to delete cols that probs wont change over time unless business close.

REVISITING QUESTION 2

Bonus Regression Part 2

REVISIT PART 3 time permitting

My computer cannot load this dashboard, but it can output the regression profile. I have commented out the code my computer will crash running.